254 research outputs found

    SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

    Get PDF
    The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ,) that achieves high performance with small energy requirements. The SAMIE-LSQ classifies the memory instructions (loads and stores) based on the address to be accessed, and groups those instructions accessing the same cache line in the same entry. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. Each SAMIE-LSQ entry has space for several memory instructions accessing the same cache line. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays. This is achieved by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry once the access of one of the instructions in an entry is performed, so instructions that share an entry can reuse the translation, avoid the tag check and get the data directly from the concrete cache way without checking the others. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% for the LI data cache and 73% for the data TLB, with a negligible impact on performance (0.6%)Peer ReviewedPostprint (published version

    Inherently workload-balanced clustered microarchitecture

    Get PDF
    The performance of clustered microarchitectures relies on steering schemes that try to find the best trade-off between workload balance and inter-cluster communication penalties. In previously proposed clustered processors, reducing communication penalties and balancing the workload are opposite targets, since improving one usually implies a detriment in the other. In this paper we propose a new clustered microarchitecture that can minimize communication penalties without compromising workload balance. The key idea is to arrange the clusters in a ring topology in such a way that results of one cluster can be forwarded to the neighbor cluster with a very short latency. In this way, minimizing communication penalties is favored when the producer of a value and its consumer are placed in adjacent clusters, which also favors workload balance. The proposed microarchitecture is shown to outperform a state-of-the-art clustered processor. For instance, for an 8-cluster configuration and just one fully pipelined unidirectional bus, 15% speedup is achieved on average for FP programs.Peer ReviewedPostprint (published version

    Low-complexity distributed issue queue

    Get PDF
    As technology evolves, power density significantly increases and cooling systems become more complex and expensive. The issue logic is one of the processor hotspots and, at the same time, its latency is crucial for the processor performance. We present a low-complexity FP issue logic (MB/spl I.bar/distr) that achieves high performance with small energy requirements. The MB/spl I.bar/distr scheme is based on classifying instructions and dispatching them into a set of queues depending on their data dependences. These instructions are selected for issuing based on an estimation of when their operands will be available, so the conventional wakeup activity is not required. Additionally, the functional units are distributed across the different queues. The energy required by the proposed scheme is substantially lower than that required by a conventional issue design, even if the latter has the ability of waking-up only unready operands. MB/spl I.bar/distr scheme reduces the energy-delay product by 35% and the energy-delay product by 18% with respect to a state-of-the-art approach.Peer ReviewedPostprint (published version

    Modelling probabilistic cache representativeness in the presence of arbitrary access patterns

    Get PDF
    Measurement-Based Probabilistic Timing Analysis (MBPTA) is a promising powerful industry-friendly method to derive worst-case execution time (WCET) estimates as needed for critical real-time embedded systems. MBPTA performs several (R) runs of the program on the target platform collecting the execution times in each run. MBPTA builds a probabilistic representativeness argument on whether those events with high impact on execution time, such as cache misses, arise on the runs made at analysis time so that their impact on execution time is captured. So far only events occurring in cache memories have been shown to challenge providing such representativeness argument. In this context, this paper introduces a representativeness validation method (RVS) to assess the probabilistic representativeness of MBPTA’s execution time observations in terms of cache behaviour. RVS resorts to cache simulation to predict worst-case miss scenarios that can appear during the deployment phase. RVS also constructs a probabilistic Worst-Case Miss Count curve based on the miss-counts captured in the R runs. If that curve upperbounds the impact of the predicted cache worst-case scenarios, R is deemed as a sufficient number of runs for which pWCET estimates can be reliably derived. Otherwise, the user is requested to perform more runs until all cache scenarios of interest are captured.Peer ReviewedPostprint (author's final draft

    High-Integrity Performance Monitoring Units in Automotive Chips for Reliable Timing V&V

    Get PDF
    As software continues to control more system-critical functions in cars, its timing is becoming an integral element in functional safety. Timing validation and verification (V&V) assesses softwares end-to-end timing measurements against given budgets. The advent of multicore processors with massive resource sharing reduces the significance of end-to-end execution times for timing V&V and requires reasoning on (worst-case) access delays on contention-prone hardware resources. While Performance Monitoring Units (PMU) support this finer-grained reasoning, their design has never been a prime consideration in high-performance processors - where automotive-chips PMU implementations descend from - since PMU does not directly affect performance or reliability. To meet PMUs instrumental importance for timing V&V, we advocate for PMUs in automotive chips that explicitly track activities related to worst-case (rather than average) softwares behavior, are recognized as an ISO-26262 mandatory high-integrity hardware service, and are accompanied with detailed documentation that enables their effective use to derive reliable timing estimatesThis work has also been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Enrico Mezzet has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva-IncorporaciĂłn postdoctoral fellowship number IJCI-2016- 27396.Peer ReviewedPostprint (author's final draft

    Execution time distributions in embedded safety-critical systems using extreme value theory

    Get PDF
    Several techniques have been proposed to upper-bound the worst-case execution time behaviour of programs in the domain of critical real-time embedded systems. These computing systems have strong requirements regarding the guarantees that the longest execution time a program can take is bounded. Some of those techniques use extreme value theory (EVT) as their main prediction method. In this paper, EVT is used to estimate a high quantile for different types of execution time distributions observed for a set of representative programs for the analysis of automotive applications. A major challenge appears when the dataset seems to be heavy tailed, because this contradicts the previous assumption of embedded safety-critical systems. A methodology based on the coefficient of variation is introduced for a threshold selection algorithm to determine the point above which the distribution can be considered generalised Pareto distribution. This methodology also provides an estimation of the extreme value index and high quantile estimates. We have applied these methods to execution time observations collected from the execution of 16 representative automotive benchmarks to predict an upper-bound to the maximum execution time of this program. Several comparisons with alternative approaches are discussed.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the PROXIMA Project (grant agreement 611085). This study was also partially supported by the Spanish Ministry of Science and Innovation under grants MTM2012-31118 (2013-2015) and TIN2015-65316-P. Jaume Abella is partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013- 14717.Peer ReviewedPostprint (author's final draft

    Resilient random modulo cache memories for probabilistically-analyzable real-time systems

    Get PDF
    Fault tolerance has often been assessed separately in safety-related real-time systems, which may lead to inefficient solutions. Recently, Measurement-Based Probabilistic Timing Analysis (MBPTA) has been proposed to estimate Worst-Case Execution Time (WCET) on high performance hardware. The intrinsic probabilistic nature of MBPTA-commpliant hardware matches perfectly with the random nature of hardware faults. Joint WCET analysis and reliability assessment has been done so far for some MBPTA-compliant designs, but not for the most promising cache design: random modulo. In this paper we perform, for the first time, an assessment of the aging-robustness of random modulo and propose new implementations preserving the key properties of random modulo, a.k.a. low critical path impact, low miss rates and MBPTA compliance, while enhancing reliability in front of aging by achieving a better – yet random – activity distribution across cache sets.Peer ReviewedPostprint (author's final draft

    Penelope: The NBTI-aware processor

    Get PDF
    Transistors consist of lower number of atoms with every technology generation. Such atoms may be displaced due to the stress caused by high temperature, frequency and current, leading to failures. NBTI (negative bias temperature instability) is one of the most important sources of failure affecting transistors. NBTI degrades PMOS transistors whenever the voltage at the gate is negative (logic inputPeer ReviewedPostprint (published version

    On the tailoring of CAST-32A certification guidance to real COTS multicore architectures

    Get PDF
    The use of Commercial Off-The-Shelf (COTS) multicores in real-time industry is on the rise due to multicores' potential performance increase and energy reduction. Yet, the unpredictable impact on timing of contention in shared hardware resources challenges certification. Furthermore, most safety certification standards target single-core architectures and do not provide explicit guidance for multicore processors. Recently, however, CAST-32A has been presented providing guidance for software planning, development and verification in multicores. In this paper, from a theoretical level, we provide a detailed review of CAST-32A objectives and the difficulty of reaching them under current COTS multicore design trends; at experimental level, we assess the difficulties of the application of CAST-32A to a real multicore processor, the NXP P4080.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal grant RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Modeling RTL Fault Models Behavior to Increase the Confidence on TSIM-based Fault Injection

    Get PDF
    Future high-performance safety-relevant applications require microcontrollers delivering higher performance than the existing certified ones. However, means for assessing their dependability are needed so that they can be certified against safety critical certification standars (e.g ISO26262). Dependability assessment analyses performed at high level of abstraction inject single faults to investigate the effects these have in the system. In this work we show that single faults do not comprise the whole picture, due to fault multiplicities and reactivations. Later we prove that, by injecting complex fault models that consider multiplicities and reactivations in higher levels of abstraction, results are substantially different, thus indicating that a change in the methodology is needed.The research leading to these results has received funding from the Ministry of Science and Technology of Spain under contract TIN2015-65316-P and the HiPEAC Network of Excellence. Carles Hern´andez is jointly funded by the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Postprint (author's final draft
    • …
    corecore